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Abstract 

We propose a new approximation approach to solve a discrete Markov decision model 
(DMD) with a large state space. The DMD is an structural model which can analyze 
data obtained from agents making dynamic decisions, however, to solve DMDs with a large 
number of discrete states is always difficult (and at times, impossible) because of a huge 
computational cost. The number of the states in DMDs increases exponentially as we in¬ 
troduce state variable, and this phenomenon is called “The Curse of Dimensionality.” To 
overcome this problem, we propose the new approach, named a statistical least square tem¬ 
poral difference method (SLSTD), that can solve DMDs containing the large state space with 
a low computational cost. The SLSTD can easily solve a Bellman equation of DMDs with 
a high dimensional variable, by employing two approximation techniques. Experimentally, 
the SLSTD performs faster and more accurate than other existing methods, and in some 
cases, reduces the computation time by over 99 percent. We also show that an estimator for 
a parameter of interest obtained by the SLSTD has the consistency and the asymptotically 
normality. 

JEL Classification : C63, DOl. 


1 Introduction 


A discrete Markov decision model (DMD), also known as a dynamic discrete choice model, is 
extensively used for analyzing a behavior of agents. The main advantage of the approach with 
the DMD is that it admits us to implement the counterfactual analysis of agents, since the 
DMD can handle the dynamic decision making of the agents. The agents in the DMD observe 
their own state in each period, then decide their action with considering future reward and 
transition between states. The action by the agents is often formalized as a discret e choice, and 
we can analyze the characteristics of the agents by investigating the realized choice. Rust ( 19871 ) 
suggested the DMD, and has found many applications in present-day works in econometrics, 
marketing science, transportation science, and the dynamic games theory. 

Since the estimator for the parameter of the DMD rarely has an analytical form, implement¬ 
ing the DMD approach requires numerical calculation. However, to solve the DMDs requires 
conducting the highly complex nonlinear computation, hence the computational cost restricts 
the flexibility of the DMD. Thus, many empirical researches are suffered from the computa¬ 
tional hurdle and forced to reduce volume of their DMD. Accordingly, many methodological re ¬ 
searches have suggested compu t ational methods to re move the hurdle. Hotz and Miller ( IQQ.'ll l. 
Aguirreeabiria and Mira ( 2002 ). Su and Judd ( 2012), and Dube et all (l2012l) proposed meth ¬ 
ods to solve the DMD with a singl e agent . lAgnirregabiria and Miral pOOTl l. lBaiari et ali pOOTI h 
Pesendorfer and Schmidt-Dengleil ( 20081 ) . and Egesdal et al. ( 20151 ) proposed efficient methods 
to solve the DMD including multiple agents and game structures. 

“The curse of dimensionality,” that refers to the exponential rise in the number of grid 
points in the state space of the DMD, is one of the such computational difficulties. When we 
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increase the number of discretized state variables in the DMD, the grid points in the state space 
increases exponentially and it leads to the extreme rise in the computation cost. Furthermore, 
such large state spaces require a large amount of computational memory to store a great number 
of numerical values, and in some instances, ordinary computers are not even up for the task. 

Several methods have been suggested to handle to the curse of dimensio nality in the DMD. 


There are some general me t hods that we can apply into various DMDs. iKeane and Woloin 


( 1997l i and Imai and Keane ( 2004l i proposed a method to approximate the v a lue fu nction by 
basis functional approximation method. A series estimation based on Judd ( 199(11 1 is also a 
method to handle the curse and it can be applicable to a wide range of DMD s. There are other 
metho ds fo r specihc DMDs which are sp e cializ ed to analyze specihc topics. Hendel and Nev^ 
( 20nfil i and Gowrisankaran and Rvsman (2012) solve s pecific DMD s for analyzing consumer 
choices. Another method is the Monte Carlo method by iRustI (jl997l ) that can solve any model 
with up to a certain number of dimensions. 

Despite the rich researches against the curse of dimensionality, few methods achieve both 
the generality and the suffic ient cost reduction. Some methods are valid for specihc problem, 
like Hendel and Nev3 ( 20061 1 and Gowrisankaran and Rvsman ( 2012l h thus it is not applicable 
when we try to analyze the other topics. On the other hand, the general methods such as 


Keane and WolpinI (jl997l l do not have enough theoretical analysis which guarantees the result 


of the analysis, and also their performance is not sufficient in some cases. (We will discuss its 
detail in Section 4). 

Purpose of this paper is to suggest a new computational method which is applicable for wide 
range of DMDs and has sufficient computational and statistical performance. In this paper, we 
suggest a statistical least square temporal difference method (SLSTD) that can avoid the curse 
of the dimensionality of the DMD. The SLSTD focus on the Bellman equation of the DMD. 
Solving the Bellman equation is an origin of the computational burden when the state space is 
large, and reducing the cost of handling the Bellman equation is a critical problem. The SLSTD 
simplify the Bellman equation by applying basis function approximation to a high dimensional 
variable of the Bellman equation. Furthermore, the SLSTD employs the stochastic root-hnding 
technique to solve the simplified Bellman equation. By the two techniques, we can substantially 



accurate results of the parameter estimation. Second, the computation time by the SLSTD is 
nearly independent of the size of the state space, and as such, the computation time remains 
small even when the state space is large. Third, the SLSTD also has an advantage from an aspect 
of the computational memory. We also provide the asymptotic properties of the parameter 
estimation obtained by the SLSTD. Given some conditions on the smoothness of models and 
the number of basis functions, we can even obtain consistency and asymptotic normality. 

The rest of paper is organized as follows. Section 2 describes the fundamental structure 
of the DMD and the existing approximation methods to handle the curse of dimensionality. 
Section 3 introduces the SLSTD. Section 4 examines performance of the SLSTD by numerical 
experiments. Section 5 shows the theoretical aspects of the SLSTD. Section 6 concludes. The 
proofs are collected in the Appendix. 
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2 Model and existing methods 


2.1 Model 


The DMD is a statistical model to analyze a sequence of discrete choices, and the purpose of 
the analysis is to estimate the parameter of the agents from their choices. In the DMD, the 
observed choices depend on the state in which the agent stays, and the action of the agents 
determine the transition between states. We derive the likelihood of the actions, and estimate 
the parameters of interest by maximizing the likelihood. 

We consider the DMD with discrete time and the discrete state variable. DMDs are for¬ 
mulated as {S,A,Q,u,V). 5 is a state space with p dimensions, and each dimension j has 
Qj states. Then, the state space is represented as 5 = {sn,..., sigj} x ... x {spi,..., and 
i5i = nu Qj. A is an action space, and 0 C is a parameter space. tt:5xMx0xT— >-M 
is a reward function, and £ is the space of stochastic factors. P : 5 x M x 5 —>■ [0,1] is the 
transition probability between states, s' represents the state in the next period. 

There exist n agents, and an agent z = 1,..., n observes own state Si^t £ 5 in each period t. 
Then the agent i decide own action ai^t S A from the action space. Also, the agent i privately 
observes €i^t which is an independent stochastic factors and it is unobservable for the researchers. 
The state of the agents i evolves after the action ai^t has been made. Also it is assumed that 
the transition between states has a Hrst order Markov property. 

Denote by 0 G 0 the parameter vector which explains the characteristic of whole agents, 
and 6 is the parameter of interest for the researchers. We set that the agent i obtains reward 
u{si^t') 0) with the stochastic factor Ci^t and the parameter vector 9. At time t and with 

given the state Si^t, the agent i maximizes the discounted sum of the reward, named the value 
function as follows. 




max E 


.t'=t 




( 1 ) 


where /? G [0,1) is the discount factor and sq is the initial state. Here, v{s; 0) : 5 x 0 — )• R is a 
function, and also dehne V{si^t]G) ■= E[v{si^t] &)] and V{9) = {H(s;0)}sg5 

To analyze the decision making of the agent i, we consider the choice probability of the 
action in each period. Let P{a\si^t',(^,V{0)) be the probability of choosing ai^t in state Si^t with 
given parameter 9\ 


P{a\si^t]e,V{e)) :=E 



arg max{u(si,t, a, 9) + I3Es'^p[v{s'; 9)\si^t ,«]}) , 

a 


(2) 


where !(•) is an indicator function and the expectation Ep[-] is taken over the state transitions 

Since A has finite elements, we can combine m and Q, then we obtain the following 
equation, named the Bellman equation, as follows, 


CIG./4. 


Ua, 9) + /3 ^ V{s']9)p{s'\si^t,a) , 

s'es 


(3) 


where U{si^t,CL,ei^t',S) is an expectation of u{si^t,CL,ei^t',S) with given a. By solving the Bellman 
equation, we can obtain the value of V{9). 

Suppose that we observe a sequence of state transitions and actions The 

likelihood of the observed sequence of derived by the conditional choice probability (I2|), when 
we obtain the following log likelihood function with observation {( 5 ^, 0 *)}”^^: 


C{9,V{9)) 


1 

nT 


n T 

EE logp(si,t+i|si,t, ai^t) + log P{ai^t\si,t; 9, V(9)). 

i=l t=i 


(4) 
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By maximum likelihood estimation, we obtain the estimator 9 := argmax/l(0) while V{9) 
satishes the Bellman equation ([3]). 


Remark Practically, calculating the value of a, 0) and P{a\si^t',G,V{9)) requires a 

tedious numerical integration. To avoid the computation, several assumptions on the functional 
and distributional form of u{s, a, e; 0) are often introduced. 

When we are allowed to assume that u{s,a,e;9) = u{s,a-,9) + Cq where ea is a stochastic 
term that is i.i.d. with respect to action a and time, and follows the type-I extreme value 
distribution, we obtain the following simple forms: 


P{a\s-,9,V{e)) 


exp(ri(s, a; 9) + j3Ep\y (s'; 0)|s, a]) 
Ea exp(M(s, a; 9) + l3Ep[V (s'; 6 »)|s, a]) ’ 


In this case, the Bellman equation Q can be rewritten as 


V{s-9)=Y,P{a\s-,9,V{9)) 


Z(s, a; 9) + E[ea\s, a; 9, V ( 6 »)] + /3 ^ f(s'ls, a)V (s'; 9) 


where E\ea\s,a;9,V{9)\ represents a conditional expectation of Ca, i.e., 

Ea exp(u(s, a; 9)) + PEp[V (s'; 0 ) 15 , a]) 


E[ea\s,a]9,V{9)\ = log 


exp(M(s, a; 9) + fiEp\y (s'; 0)|s, a]) 


+ 7, 


where 7 is the Euler’s constant. This form enables us to calculate Ucl, ei^t', S) and P{a\si^t', G, V(G)) 
an alytically. 


RustI (JIDSTI) introduced the assumptions, and have drastically reduce d the cost of numerical 


i ntegr ation to calculate transition probabilities. It is discussed in detail in lAguirregabiria and Mira 

(l20n2l l. 


2.2 The Curse of Dimensionality and Existing Methods 


To solve the Bellman equation ([3]) is necessary to evaluate the log likelihood function in (jH), 
however, it is quite difficult when the state space S is large. Since the state variables are 
discretized, the Bellman equation ([3]) is regarded as an equation of |5|-dimensional vector. 
Namely, let V = ((E(s))sg 5 )^ G and rewrite the Bellman equation as V = T(V), where 
T is the right hand side of Q. However, as the number of the state variables p increases, the 
number of states is |5| = 0^=191 exponentially increases against p, and solving the equation 
V = T(V) requires huge computational time and cost. For instance, in the DMD for the career 
decision, if we allow each agents to possess 6-types of the human capitals for maximum 40 years 
for each of the types, then we obtain |5| = 40® = 4,096,000,000 in the DMD and we have 
to solve an equation with 4,096,000,000-dimensional vector. It requires a huge computational 
time, and also note that ordinal laptops cannot contain such the high-dimensional vector in 
their co mputational memory. Sim ilar examples are also introd uced in several litera ture, for 
example, Hendel and Nev3 ( 2006l l for the consumer choice and Egesdal et al. ( 2015 1 for the 
discrete choice game. 

To avoid the curse, there exist some methods to solve the DMD under the curse of dimen- 
sio nality; these are of two ty pes: general methods and problem-specihc methods. 


Keane and Woloinl (jl994l l suggested a method that can be applied to general types of DMDs. 


This method picks some states randomly in each time, and estimates the coefficients of an 
interpolation function of the picked states as 

£’[max{u(s, a) -|- /3E[E(s')]}] 

a 

Ri Lp{maxE[u{s, a) -|- I3E[V (s')]], meana£'[rt(s, a) -t- I3E[V (s')]]), 
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where is the interpolation function. Though it is handy, this method has some faults. 

First, the computation time increases rapidly. The method is mainly suited for simplifying 
Bellman equation evaluation, and thus is not good at reducing the state-space computation 
cost. Second, the theoretical framework of the method has not been sufficiently elaborated on. 
Since performance is guaranteed only by numerical experiments, its theoretical properties, such 
as consistency and size of biases, are unknown. 

We also consider another general method using the sequential series estimation method, 
which can solve many DMDs. This method, too, picks states from the state space in each time, 
and approximates V{s]9) as 


V{st;0) ^ '^rg^tk<Pk{st), 


where is a weight and (pki^t) is a basis function. Since the method a pproximates the value 
function in each period, the method requires multiple approximations. Judd ( 199fil l provides 
the idea of the series estimation, and this method applied the idea to DMDs. This method is 
useful and its convergence is theoretically guaranteed, but it does have one limitation, which we 
discuss later. Rust ( 1997l l too suggested a general method with Monte Carlo that can calculate 
a value function with no effect of an increase in the number of dimensions. Their method, 
though independent of the object of analysis, requires strong restrictions on the transition and 
state space of the model. 

So me problem-specific methods, such as those by Hendel and Nevo ( 20061 1 and Gowrisankaran and Rvsman 
(l 2 ni 2 l l. are for consumer choices. These methods display high performance in market analysis, 
but depend on the specific characteristics of the market and are not applicable to other DMDs. 

Thus, while problem-specific methods are fast, they cannot solve other general problems, 
and while some general methods enjoy wide applicability, they are not computationally feasible. 

Accordingly, there is a need for a general method that achieves computational feasibility. 

Other popular method to estimate th e DMD include the conditional choice probability 
(CCP) method by Hotz and Miller ( IQT'll l. the nested pseudo likelihood (NLP) method by 
Aguirregabiria and Miral (1200211 . and the math ematical programming with equilibrium con¬ 
straint (MPEG) method by Su and JuddI (2012). The SLSTD works under the curse of dimen¬ 
sionality, whereas these methods are design to solve D MDs with relatively s mall state spaces. 

As an example, consider the carrier decision model bv iKeane and Wolpinl (jlOOTl l. which has 
over 1 million states (|5| > 1,000,000). It is impossible for implement the MPEG and GCP 
methods to solve such a model, because these methods need to provide a |5| x |5| numerical 
matrix, which is computationally not feasible. 


3 Proposed Method 


We introduce the SLSTD which solves the Bellman equation ([3]) approximately with low com¬ 
putational cost. The SLSTD employs mainly two techniques, (i) the functional approxima¬ 
tion method, and (ii) the stochastic approximation method. After solving the Bellman equa¬ 
tion by the SLSTD, we provide a formation about (iii) the parameter estimation. The main 


, 111 ) tne pg 
Siittonl (jlf 


idea of the SLSTD is ba sed o n the TD method by 
Bradtke and Bart^ ( 1996l l and Nedic and Bertsel^ ( 2003l l 
Preliminarily, we provide some notation. H/H^ := 
thonormal system in L?‘{S) with (pj : S — 


19881 1 and the LSTD method by 


where S C 


A: > 1, consider a vector-valued function (/)(s) := ...,(j)k{s)) 

For brevity, we define a Bellman operator r[-] such as 


s(zsf{s) - Let {0j(s)}~i is an or¬ 
is a convex hull of S. With given 
T 


T[V{e)]{s,a) :=U{s,a, 


l3^V{s';e)p{s'\ 

s'es 


s,a 
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then the Bellman equation Q can be rewritten as 

y(s; 9) = Y^ PHs-, e, V{e))T[V{e)]is, a). (5) 

aeA 


Also we let V*{9) and P*{9,V*{9)) be a so lution of the Bellm an equation. Note that the part 
{V{9)*, P{9,9*)) is probed to be unique by Rust et al. ( 2002 1 for each 9. 


3.1 Method Outline 

In this section, we provide an outline of the SLSTD. Purpose of the SLSTD is to solve the 
Bellman equation ([5]) (the simplihed version of ([3|)). The SLSTD employs the two approxima¬ 
tion techniques, (i) the basis functional approximation, and (ii) the stochastic approximation 
method. 


(i) Basis Functional Approximation of the Value Function : With given 9, we approx¬ 
imate the value function V{s;9) as 


k 

V{s]9) « (j)^{s)we = 

i=i 


( 6 ) 


where tc G is a vector of approximation weights wqj. Since the approximation is regarded 
as a projection of V{s) onto the linear space spanned by {(^j(s)}, there exists a unique op¬ 
timal weight Wg := argmin^^ \\(j){-)'^wo — F*(-;0)||5 by the projection theorem. Note that 
min,,,^ II — /(-jlli converges to zero as A: —)■ oo when / is sufficiently smooth (See 

Tsvbakov (200^)). 


By the functional approximation ([6|), we can represent V{s]9) (|5|-dimensional vector) by 
wg (fc-dimensional vector) by using the given orthonormal system. Since we set k is much less 
than |5|, we can avoid the high dimensionality of the Bellman equation. 

However, it is not enough to solve the curse of dimensionality. There are some problems 
remain : (a) the problem of obtaining Wg which approximately solves the Bellman equation 
remains, (b) the computational cost reduction is not enough, and (c) the accumulation of 
the approximation problem appears. Especially, the problem (a) is critical. The solution of 
the Bellman equation should satisfy the equation ([6]) for all s € 5. Thus, it is necessary for 
evaluating the equation (l6|) for all s G 5 to solve the Bellman equation by ordinal method, such 
as the Newton’s method. However, as we already discussed, |5| is too large in some cases, hence 
it require a high computation cost. In the rest of the section, we introduce additional method 
to solve the problem (a). The problems (b) and (c) will be discussed in Section 4. 


(ii) Stochastic Approximation for Obtaining wg : To estimate the optimal weight Wg, we 
evaluate the goodness of the approximation with given wg. In the view of P{cl\s] 9, V(9)) = 

1, V*{9) satishes T’(a|s; H(0)) [R(s) — T[I/(-)](s, a)] = 0 for all s G 5 with given 9. Then, 

we define the similar moment condition for wg as follows. We consider the minimization problem 


mm 

wg 




n 2 


^ P (a|s; 0, 4)^{s)wg) {(l)^{s)wg - T[(f'{-)wg]{s, a)} 


.aeA 


(7) 


where D is a set of s G 5 generat ed from the empirical distrib ution. For the minimization 
problem, we implement Lemma 6 in iTsitsiklis and Van RovI ()l997l i. we obtain the weak form of 
a hrst order condition (jT]). 


X] X] ^ (“I®’ <P^is)wg) [(t)^{s)wg - T[(p^{-)wg]{s, a)) 

S'^D .a£A 


= 0 . 


( 8 ) 
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To solve dSD, we implement the stochastic approximation method in iBenveniste et ali (j2012l l. 
The stochastic approximation method is an algorithm to find a root of an equation which is given 
by a form of an expectation. As a sequence of random variables generated from a probability 
distribution is observed one-by-one, the stochastic approximation update the solution of the 
equation, and the sequence of the solut ion converges to the roo t. Here, we cite Theorem of the 
stochastic approximation method from IBenveniste et ali (|2ni2l l. 

Theorem: Stochastic approximation I Benveniste et al\ ( 2012h . Theorem 17). Let 

X be a random variable with transition probability Ii{x\x) and denote F{w) = EYi[f{w,X)] 
where is a expectation under stationary distribution. Suppose that there exists a unique 

satisfying F{w*) = 0. Further, consider a decreasing sequence {rii}i, where YIT Si = oo and 
YIT Vi < Suppose 

• f{w,x) has an envelope function with a polynomial of x and a linear function of w 

• {w — w*)Eii[f{w, X)] < 0 holds for all w ^ w*. 

hold. If a sequence of {tCj} is generated in the following iteration equation 


Wi+i =Wi + r]if{wi,Xi), 

then the sequence {tCj} converges to w* with probability 1 as i ^ 00 0 

To apply the theorem, we set X = s, f{w0, Xi) = (/)(s) ^ (“I'®’ {s)we) {T[4>^ {■)w0]{s) — 

4>{s)'^W0} and n(Aj, Aj+i) = f{s'\s, o)P(a|s; wq). We check the assumptions in Appendix. This 
algorithm by the stochastic approximation method enables us to solve the problem ([ 8 ]) without 
summing up or integrating (j){s) P (a|s; 9 , (lP"{s)w0) {r[0^(-)'u;e](s) — with respect 

to the all state in each step. 

By the stochastic approximation method, we define the sequence from the following 

equation: 

= wP +rii(l){si)'^P {a\s]e,(t)^{s)wP'^ {■)wP]{s^,a) - 4>{se)wP'^ , (9) 

a&A 

with step size which satisfies % = oo and YIT vj < oo. The initial point of wP is 

arbitrary. The basic approach underlying the SLSTD is modifying the approximation parameter 
Wf^ iteratively as per the temporal difference; this is why we refer to the method as the TD 
method. 

Based on the stochastic approximation method, we define the estimator of Wq as a limit of 
the sequence ([9|) as follows: 

W 0 := lim wP. 

Now, by the SLSTD approach, we obtain the estimator of the value function V{s; 9) as 

V{s;9) := (p{s)w 0 , 

with given 9. Note that the estimation of V{s;9) is implemented for each fixed 9, and the 
estimator V{s]9) can differ for each 9. 

^We set U{w) = (w — wo)^ and p„{w,Xi) = 0 in Theorem 17 of IBenveniste et al\ (l2012l i. 
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(iii) Parameter Estimation : Finally, we define the estimator of the parameter of interest 
by the SLSTD as 


9 := argmax£(0,1/(0)), 

6 

where C is defined in Q. Note that when we have to evaluate the value of £(9,V(9)) with 
different 0 during the optimization with respect to 6 , we have to rerun the SLSTD for each 9. 
It looks costly at first glance, however, it is not a problem in practice. Details are shown in 
Section 4. 

3.2 Implementation and Discussion 

To proceed with iteration ([9|), we need to prepare the sequence of the states Since we use 

the set D of the state from the empirical distribution in ([7]), we use the state transitions from 
the observed data for the iteration. In other words, The SLSTD approach intensively minimizes 
the error of the Bellman equation ([5|) on states whose agents pass frequently. 

Consider n agents, with the ith agent having a state transition of length T. Now we have 
a set of observed state transition First, we implement the iteration ([9|) on the state 

transition of one agent. When the first agent reaches a terminal state, we carry on the w and 
continue the iteration Q with the state transition of the next agent. We repeat the operation 
for all n agents. Thus, total number of observation \s N = nT. This method fits structure of 
data in econometric field which has many agents, contrast to the ordinal TD method implements 
the iteration with one long state transition. 

This operation has another interpretation. We unite the data of all agents as one agent’s 
repetitive action to generate the decision sequence. If the model has a terminal state and the 
repetitive agent reaches it, he does not obtain any reward and goes back to the initial state 
with probability one. Then, the state transition become irreducible for all state, thus we can 
recognize the state transition as having stationary distribution. 

Practically, we do not have to generate the sequence until £ = N. We can stop the 

iteration, when we judge the convergence of the sequence, namely, \\w^'^ — is no less 

than the sufficient small predetermined tolerance level r > 0. 

We now provide a pseudo code of the SLSTD. Algorithm [1] shows how to solve the Bellman 
equation with given parameter 9. A sequence of is predetermined, and along with the initial 
approximation weight 

We now discuss the initial point of tc® and the tuning parameter {%}i. When is far 
from the limit, the convergence of Wg becomes unstable. Thus, the initial point needs tuning 
when the solution is not stable. The select of the step size is a more important problem. We 
use '/]£ = satisfies < °° with positive c\ and C 2 . When ci is 

too small or C 2 is too large, the step size becomes small and solution is strongly affected by the 
initial point. Thus, selecting a proper step size, too, is necessary to obtain a stable solution. 

We now discuss the choice of basis functions. The power series and B-splines are common 
as basis functions. However, the B-spline function provides better estimates. Since the SLSTD 
locally evaluates the value function approximation on the observed state transition, the B-spline 
functions provide more precise approximation. 


4 Numerical Experiment 

4.1 Parameter Estimation 


To compare the estimation accuracy of the methods, we conducted a numerica l experiment 
using some empirical DMDs. We consider the method of iKeane and Woloinl (jl994l i (henceforth. 
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Algorithm 1 SLSTD : Derive we with given 6 


Given 6, s.t. = oo and Y.T=i'nj < oo 

Initialize w arbitrary 

Set t = 1 
1 

loop 

for i = 1,... ,n do 
for t = 1,..., Tfc do 


if s is the terminal state then 

^ w^g'’ + r]e(l){s){-(j)^{s)w^g^} 

else 

^ wf^ +r]i(l){s)Yl^^^P (a\s; 9,^'^{s)wf^^ (T[(P'^{-)wf^]{s,a) - 

end if 
f ^^ + 1 

end for 
end for 

if — Wg^'^W < T then 

break loop 

end if 
end loop 


KW), the sequential series estimation, a nd the SLSTD. We u s ed a D MD for analyzing the career 
decision which is a simplified version of Keane and Wolnin ( 1997l V The DMD is a finite time 
model, with an adjustable number of state variables and actions of the agents, and the time 
horizon. 

We present t he DMD used in ou r expe riment and it is often used in the labour economics 
analysis, such as Keane and Wolnin ( 1997l b The model has p state variables and its terminal 
time is T. An elements of the state space has a form s = (si, S2 , S3, S4) £ S: si is age, S2 is a 
education year, S 3 is a carrier of work and S 4 contains the choice in the previous period. The 
state space is constructed as 5 = {sn,..., sit} x {s2i, ..., S2t} x {S31,..., sst} x {S41,..., S43}. 
We set a action space as A = {1,2, 3}. We set a reward function as 


u{s, a; 9) 


6 *iS 2 , if a = 1 , 

< 6 » 2 S 2 + 0353 , if a = 2 , 
04 , if a = 3. 


The choice a = 1 is a decision of schooling and the choice increases si and S 2 by one. The choice 
a = 2 is a decision of working, and it increases si and S 3 by one. The choice a = 3 is staying 
home, and it increases only si. 

We generate data with n = 1,000 agents, and estimate the parameters by the generated 
data, q = |5| represents the number of elements in the state space. When we T = 30, then 
we have q = 81,000 elements. We set the stochastic term e to follow a type-I extreme value 
distribution. Since the state space of this model is not so large, we can estimate 9 without the 
approximation method, such as the SLSTD, the KW method, and the series representation, 
for comparison. To use the SLSTD, we use the B-spline functions as the basis functions. We 
also set same basis functions for the sequential series method, and we set 5 grid points in each 
dimension for approximation. For the KW method, we provide 100 grid points in each period. 

First, we compare the parameter estimation. For estimation, we use the simulated numerical 
data generated from the model with the true parameter. We generate data with 1,000 agents. 
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and replicate the experiment 200 times. Table [T] and Table [5] show the result with different 
parameter sets. The tables also contain the computational time for solving the Bellman equation 
with one parameter set, and the squared error of the Bellman equation which is a difference 
between the left hand side and the right hand side of ([5]). Here, A denotes the error of the 
Bellman equation. The values are the means of the estimators of the replications; the figures in 
parentheses are standard deviations. 

The SLSTD provides a less estimation bias than other methods for every cases. In some 
cases, the sequential series estimator has better performance; this is because the model is linear 
and very simple, and thus, is likely to perform better in series estimations. However, when 
model gets larger or complex, the sequential series method does not work well. In contrast, the 
standard deviation of the SLSTD are larger than others. The KW method performs unstable 
for most cases. As the state space bigger, the bias becomes larger. 

From a point of the Bellman error, the SLSTD performs best, and the series method performs 
worst. From a point of computational cost, the KW methods requires more computational time. 
The SLSTD and the sequential series method provides less computational cost in Table [T] and 
Table [2j We will investigate this point further in the following section. 

4.2 Computational Cost 

Next, we see the results of experiments about computational burden. We use the same model 
as in the estimation part, and change the value of p and T. The following results show the 
time-to-solve the Bellman equation for given true parameters. For estimation, we repeat the 
same process for 200 times. 

Table [3] shows the results. The unit of time is seconds. We can see that the sequential and 
the KW methods cannot avoid the exponentially increasing computational burden. The burden 
is particularly severe for the KW method. In contrast, the computation cost by the SLSTD 
does not increase exponentially. 

We provide some explanations for these results. First, the SLSTD evaluates the Bellman 
equation on less number of states. The SLSTD refers only the state that is observed as data, 
and avoid checking other states due to the stochastic approximation method. In contrast, other 
methods need to refer more number of states for evaluating the Bellman equation. Secondly, 
the SLSTD uses less computational memory, thus it realizes less computational cost. The 
sequential series and the KW methods require a large memory to store the numerical values of 
{K(s; 0)},65. When the memory for use is quite large, accessing the memory becomes much 
costly. On the other hand, the SLSTD only stores information about the approximation weight 
wg and the basis functions. When the values of some states are required, the weight and basis 
functions are sufficient to recall {K(s;0)}se5. This contributes to the computational advantage 
of the SLSTD. 

Note also that the sequential series method fails to obtain the value when p = 5 and T = 30. 
This is because of the increased error from the accumulation of sequential approximation. Be¬ 
cause it accepts backward induction, we have to approximate {K(s;0)}sg 5 in each period. As 
{K(s; 0 )}sg 5 is approximated several times, the approximation error accumulates, and some¬ 
times the accumulated error diverges. Figure [T] shows the error accumulation. We try to 
approximate the DMD with labor economics in Section SI using the sequential series method. 
The horizontal axis is the time period, and the vertical axis shows the value of the approxi¬ 
mated V{s,9). We can observe the error accumulation and that its size rises exponentially by 
the multiple approximation. 

When p = 5 and T = 40, the KW method and the sequential series method cannot yield any 
results. This is because the computation time is so long that we cannot obtain the estimation 
results. Since the state space is quite large in this case, the computational memory in laptops 
cannot handle the numerical value of {K(s; 0 )}sg 5 in the usual way : in other words, these 
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two methods are not appropriate. In contrast, the SLSTD only stores the values of the weight 
W 0 e and it does not need a large memory to store the value of 0 )}sg 5 . 

4.3 Discussion 

In this section, we discuss the advantages of the SLSTD. First, the SLSTD does not suffer from 
computer memory limitations. As mentioned earlier, the sequential series method and the KW 
methods require a large memory to store |5| numerical values, and it is at times, impossible to 
store all values in the state space. For instance, when |iS| = 10^^, the requirement is 80 GB of 
memory. In contrast, the SLSTD only stores value of w which can recover value of F(s;0) for 
all s. 

The second advantage is in terms non-smoothness of DMDs. Figure [T] presents a accumu¬ 
lation of approximation error by the series method, when the DMD has non-smooth term. As 
the series method requires multiple approximation, the horizontal axis of figure [1] a number of 
the approximation and the vertical axis is a size of the error of the Bellman equation. It is 
easy to check that the multiple approximation causes the error approximation. The sequential 
series method fails when V{s; 6 ) does not have enough smoothness. In contrast, the SLSTD can 
avoid the problem because it approximates the V{s]9) at once. In addition, the SLSTD deliv¬ 
ers a theoretical analysis about smoothness and estimation. In following section, we provide a 
theoretical analysis of the SLSTD. 

The third advantage lies when applying to a non-logit type model. Throughout many empir¬ 
ical researches, it is often required that the stochastic term e has a type-I extreme distribution 
and the choice probability is represented in the multinomial logit form. If we use a non-logit 
type model, the derivation of choice probability requires costly numerical integration. However, 
as shown earlier, the SLSTD reduces the number of times the choice probability needs to be 
derived. Thus, the SLSTD has a relative advantage when applying to a non-logit type model. 


5 Theory 

In this section, we show the consistency of I/(s; 9) provided by the SLSTD, and the asymptotic 
properties of the estimator 9. In this section, we define the norm || • || as the Euclid norm, and 
consider ||F (-)||5 = sup^^g^ |F(s)|. Vj, represents a partial differentiation with respect to x. 


5.1 Property of l/(s;0) 


As for eval uating V(s\9), our theo r etical resul ts of the stochastic approxim ation part mainly 
depends on Tsitsiklis and Van Rov (1997) and Taeorti and Scherrer ( 2015I L The error in the 
ap proximation b y th e basis function s comes from nonparametric series estimation, in the line 
of Newev ( 1997lf and Andrews (1991). 

We also consider a following asymptotic setting : q = |5| increases as N increases. We 
write the settings as q > CN for some C > 0. In the field of empirical researches, there is 
a correlation between the size of state space and the number of observation. For example, in 
the DMD we used in Section 01 N and q are correlated through terminal time T. This setting 
decently explains properties of the state spaces in actual empirical researches. 

To apply the theories, we assume the following conditions. Eq!-] denotes the expectation 
with a stationary distribution. To evaluate the shape of V{s]9), we define a new function on 
continuous space V* : S x Q ^ TZ which satishes sup^g^ |P*('S; 9) — 0)| = 0 for all 9. El 


Assumption 1. Assume that 


1. u{s,a,e-,9) is bounded. 

^An existence of T*(s; 9) is obvious. If F*(s; 9) is not unique, we accept F*(s; 9) with maximum m. 
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2. V*{s]9) is m times differentiable with respect to s G S. 


The following lemma provides a consistency of the stochastic approximation method. 
Lemma 1. If assumption^ holds, then for all 6 G &, we obtain with large probability, 

\\4>^{s)we - y*(s;6')||5 = Op logiV + -^ + . 

The proof is in Appendix iBl Now, we obtain the convergence rate of V (s; 9), as it becomes an 
important factor of the following asymptotic estimation analysis. When iV, A:—)• oo, )• 0 

and m > 2p, we obtain \\(j)^{s)wg — 0)||5 = op(l). 


5.2 Asymptotic properties of the estimation 

In this section, we provide the asymptotic normality of the estimator for 9. Throughont this 
analysis, we recognized V (s; 9) as a nui sance par a meter , and treated 0 as a parameter of interest. 
The asymptotic result is provided by Kosorok ( 200Cll l which analyzes the semiparametric M- 
estimator. 

First, we formalize the estimation problem. Since that we observe N = nT state transitions, 
we use j = 1,... , A as an index of the observation, and i{j) and t{j) is a corresponding index. 
According to the empirical estimation equation Q, we rewrite the log likelihood function as 


l^Zj, 9,V[9)) . logp('Sj('j') I 

where Zj := ^ ^ :=SxAxS. Then, C{9,V(9)) := jfT,f=il{Zj;9,V{9)) 

and = E[l{Z-,9,V{9))]. We define the true parameter as 0o = argmaxg £o(^) 

To show the asymptotic result, we consider the following assumptions, in the line of IL. 

Assumption 2. Assume that 

1. 9q is an interior point in the compact 0, and is the unique maximizer of Co{9,V*{9)). 

2. ^‘eh^o{9,V{9 ))\q^q^ + exists and it is non-singular. 

3. For all s gS, V*{s-, 9) is Lipschitz continuous with respect to 9. 

4- With some radius 6 n = the class {1{Z-,9,V{9)) : Z —)• M |6* G 0, ||y(0) — A*(0o)|| < } 

is P-Donsker. 


5. \VC{9, V{9)) - VC{9o,V*{9o))\ = 0{\\9 - 0o||) + 0{\\V{9) - V*{9oW) hold as N ^ oo, 
for all 9 G ||A(0) — A*(0o)|| < with some radius 6 n = 


Assumption 2-{T]is an identification condition for the true parameter 9q. Assumption 2-[2]is 
for the regularity for the estimation prob lem of DMDs, a nd th ey are generally assumed in the 
asymptotic statistics. (For example, see Van der Vaart ( 20001 1. 1 One can criticize that there 
exist empirical researches without the identifiability of the true parameter, thus we have to pay 
attention to the such cases. Assumption 2-i3] requires Assumption 2-2] is somewhat abstract, 
however, it can cover the wide ran ge of the funct i ons, s uch as smooth, monotone, Lipschitz 
continuous functions, introduced in Van der Vaart ( 2000l l. When we let the reward function 
u{s,a]9) and p(s'|s,a) satisfy the such properties, Assumption 2-2] holds. Assumption 2-|5] 
requires a kind of smoothness. Though it is strong assumption a little, it is general requirement 
in the literature of the semiparametric statistics. 

The following theorem provides asymptotic normality. 
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Theorem 1. If assumptions [1\ and\^ hold, then 


\\0 - Oo 


0 , 


as N ^ oo. 

, 1 __ m , . 

Furthermore, + ^ =0{N~^'^) holds, then 


where 


Tn =E 


Vn{ 9 - Oo) A Ar(0, H^^E[ror^]H^^), 


{Ve l{Z-,0,V*{e))\,^,^+Vv l{Z;0,V*m\9=eo) 


(g )2 


Ho =V2 Coi0,V*i0))\g^eo+^e^v M0,V*{0))\y^e)=v*ieo),9=eo > 


as N 


oo. 


Here, F® denotes with a vector F. The proof is provided in Appendix ICIi This 

theorem shows that the consistency of 9 is guaranteed in most cases. Moreover, when the 
smoothness of F(s;0) is enough relative to the number of dimensions p, we obtain asymptotic 
normality of the estimator. For instance, k,N ^ oo,k = and ^ | ensures that 

the condition holds. In the proof, we treat the estimation problem @ as a s emipa rametric 
M-estimation by regarding V{s]9) as a nuisance parameter. Ichimura and Lee ( 20ld i enables 
us to derive the asymptotic variance of the estimator. 


6 Conclusion 

We suggested a new approximation technique, the SLSTD, to solve discrete Markov decision 
models with a large state space. Because the curse of dimensionality makes the computation 
cost enormous, it prevents development of research using DMD models. We numerically show 
that the SLSTD can approximate and solve the Bellman equation with a low computation cost. 
Further, the asymptotic theory guarantees that the SLSTD has good properties. 
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The variance components To and Hq have analytical forms when e has some parametric distribution. 
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A Figures and Tables 


Method 

P 

T 

1*51 

01 

02 

03 

04 

Time (sec) 

A2 

True param 


1.0 

2.0 

1.0 

9.0 


SLSTD 

4 

10 

3000 

0.97 

2.49 

1.16 

5.89 

0.44 

2.2E-h01 





(0.17) 

(0.23) 

(0.17) 

(0.31) 

(0.003) 

(2.9E-h01) 

Sequential 

4 

10 

3000 

1.06 

2.03 

0.99 

8.91 

0.09 

3.0E-h04 





(0.04) 

(0.11) 

(0.02) 

(0.18) 

(0.001) 

(4.1E-F04) 

KW 

4 

10 

3000 

-3.49 

2.25 

0.21 

4.63 

0.64 

3.5E-h02 





(9.66) 

(1.51) 

(0.29) 

(1.74) 

(0.007) 

(7.8E-h02) 

SLSTD 

4 

15 

10125 

0.97 

2.49 

1.16 

5.89 

0.69 

1.6E-b02 





(0.17) 

(0.23) 

(0.17) 

(0.31) 

(0.004) 

(2.5E-h02) 

Sequential 

4 

15 

10125 

-0.19 

-0.03 

-0.05 

-5.71 

0.16 

4.8E-h05 





(0.12) 

(0.11) 

(0.13) 

(7.19) 

(0.017) 

(7.1E-h05) 

KW 

4 

15 

10125 

4.46 

3.86 

-0.04 

3.46 

2.85 

1.4E-b03 





(2.30) 

(2.09) 

(0.30) 

(2.65) 

(0.016) 

(2.8E-h03) 

SLSTD 

4 

20 

24000 

0.58 

0.88 

0.64 

8.13 

0.95 

1.3E-b03 





(0.70) 

(0.81) 

(0.62) 

(6.44) 

(0.007) 

(2.3E-h03) 

Sequential 

4 

20 

24000 

0.47 

0.36 

0.46 

3.19 

0.40 

1.7E-h06 





(0.24) 

(0.14) 

(0.07) 

(5.31) 

(0.001) 

(3.0E-h06) 

KW 

4 

20 

24000 

4.76 

3.59 

1.46 

5.19 

9.62 

2.5E-b03 





(0.45) 

(0.34) 

(0.14) 

(0.66) 

(0.137) 

(5.3E-h03) 


Table 1: Estimation result 1 : Parameter estimation with another parameter set, and compu¬ 
tational time and Bellman error. We replicate 200 times with generated observation. Values in 
the table are mean and standard deviation of the estimator from the replication. 
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Method 

P 

T 

|5| 

01 

O 2 

03 

04 

Time (sec) 

A2 

True param 


2.0 

3.0 

2.0 

12.0 


SLSTD 

4 

10 

3000 

2.24 

1.20 

2.25 

11.18 

0.44 

2.1E-h02 





(0.26) 

(1.02) 

(0.34) 

(5.76) 

(0.004) 

(2.9E-b02) 

Sequential 

4 

10 

3000 

1.96 

2.99 

2.10 

11.73 

0.09 

2.1E-h05 





(0.03) 

(0.02) 

(0.01) 

(0.19) 

(0.002) 

(2.9E-b05) 

KW 

4 

10 

3000 

2.44 

1.52 

1.46 

-0.06 

0.64 

1.6E-h03 





(0.47) 

(0.29) 

(0.23) 

(1.01) 

(0.018) 

(3.6E-b03) 

SLSTD 

4 

15 

10125 

1.36 

1.97 

3.33 

12.44 

0.70 

3.5E-h03 





(1.52) 

(3.00) 

(3.91) 

(6.03) 

(0.022) 

(6.3E-b03) 

Sequential 

4 

15 

10125 

0.19 

-0.45 

1.32 

10.18 

0.16 

2.7E-h06 





(1.10) 

(3.10) 

(3.40) 

(15.33) 

(0.001) 

(4.7E-b06) 

KW 

4 

15 

10125 

9.46 

6.53 

-0.60 

6.75 

2.84 

6.3E-h03 





(3.07) 

(2.01) 

(1.07) 

(5.13) 

(0.027) 

(1.2E-b04) 

SLSTD 

4 

20 

24000 

1.81 

2.78 

1.85 

11.06 

0.96 

1.3E-h04 





(0.65) 

(0.74) 

(0.50) 

(3.19) 

(0.042) 

(2.8E-b04) 

Sequential 

4 

20 

24000 

0.55 

0.93 

0.98 

12.32 

0.40 

9.0E-h06 





(0.51) 

(1.02) 

(1.01) 

(0.08) 

(0.032) 

(1.9E-b07) 

KW 

4 

20 

24000 

10.39 

3.71 

4.57 

6.17 

9.61 

l.lE-h04 





(0.28) 

(0.52) 

(1.04) 

(6.10) 

(0.069) 

(2.5E-b04) 


Table 2; Estimation result 2 : Parameter estimation with another parameter set, and compu¬ 
tational time and Bellman error. We replicate 200 times with generated observation. Values in 
the table are mean and standard deviation of the estimator from the replication. 


Error of Bellman equation 

6000000000 - 



-1500000000 - 

1 3 5 7 9 11 1315171921 232527 


Error of Bellman equation (log) 



1 3 5 7 9 11 13 15 17 19 21 23 25 27 


Figure 1: Error accumulation for the sequential series method 
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Method 

P 

T 

|5| 

Time (mean) 

Time (s.d.) 

SLSTD 

3 

20 

1200 

0.92 

0.07 

Sequential 

3 

20 

1200 

0.04 

0.01 

KW 

3 

20 

1200 

0.40 

0.04 

SLSTD 

3 

30 

2700 

1.43 

0.39 

Sequential 

3 

30 

2700 

0.06 

0.10 

KW 

3 

30 

2700 

0.74 

0.31 

SLSTD 

3 

40 

4800 

2.88 

0.83 

Sequential 

3 

40 

4800 

0.16 

0.02 

KW 

3 

40 

4800 

1.77 

0.16 

SLSTD 

4 

20 

24000 

1.21 

0.26 

Sequential 

4 

20 

24000 

0.52 

0.13 

KW 

4 

20 

24000 

11.90 

2.23 

SLSTD 

4 

30 

81000 

1.67 

0.23 

Sequential 

4 

30 

81000 

1.15 

0.18 

KW 

4 

30 

81000 

66.90 

7.61 

SLSTD 

4 

40 

192000 

2.99 

0.86 

Sequential 

4 

40 

192000 

3.76 

1.16 

KW 

4 

40 

192000 

349.99 

85.86 

SLSTD 

5 

20 

480000 

1.06 

0.03 

Sequential 

5 

20 

480000 

7.93 

0.02 

KW 

5 

20 

480000 

3081.82 

178.30 

SLSTD 

5 

30 

2430000 

3.31 

0.47 

Sequential 

5 

30 

2430000 

69.55 

6.82 

KW 

5 

30 

2430000 

82351.47 

12344.05 

SLSTD 

5 

40 

7680000 

5.78 

0.56 

Sequential 

5 

40 

7680000 

null 

null 

KW 

5 

40 

7680000 

null 

null 


Table 3; Computation time (sec) : Computational time to solve some DMDs with difference 
size of state space. Values in the table are mean and standard deviation of the time length from 
the 200 replication. 
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B Proof of Lemma [T] 


In this proof, we keep 0 fixed and omit the notation. As mentioned before, is the solution 

of the Bellman equation, and (jP"{s)w is the approximation value obtained by SLSTD method. 

At the beginning, the approximation error of V (s) can be decomposed as 

\\4>^is)w - - n<i>(s)B*(s)||5 + ||n,i.(s)I^*(s) - B*(s)||5. 

First, we consider the term | |i;^^(s)r/)—n$(s)y*(s)| I 5 . To evaluate t his error, we have to show 
the existence of the optimal approximation weight w*. Theorem 1 of iRust et ali (|2002l l shows 
that the Bellman equat i on of DMD models has a unique fixed point solution. Then, Lemma 6 in 
Tsitsiklis and Van RovI (1997) guarantees the existence of an optimal w* that uniquely satisfies 
(f^{s)w = n,i,(s)X)^P(s|a;u;)r[(/)^(-)rc](s,a),Vs. 

Next, we show that the sequence of w generated by the stochastic approximation method 
conve rges to w*. To show t his, we verify the c onditi ons of Theorem 2 in iTsitsiklis and Van Bov 


( 1997l l and Theorem 17 in Benveniste et al. ( 2012I L Becau se the Bellman operation of D MD 


models is a contraction mapping, we can apply Lemma 9 of Tsitsiklis and Van Bov ( 1997l l and 
show that {w — w*)'^EQ['^^(l)^{st){T[(j)'{-)wt]{s,a) — (j){st)wt)] < 0. The existence of a stationary 
distribution is guaranteed by the combining of data. The compactne ss of S can satisfy the 
condit ion about the initial state. Then, we can apply the theory of iTsitsiklis and Van Bov 
( 1997I L and show that 


w 


w 


Ac cording to the discussion, we can evaluate the approximation error of the SLSTD. iTagorti and Scherrer 
( 2015l i provide a theory and show that, with a large probability. 


\\<P^{s)w-U^{s)V*{s)\\s = 0 


{vn 


logN 


Here, N is the maximum number of the iteration. 

About the second term ||n$(s)R*(s) — R*(s)|| 5 , this is an error of 12 projection of R*(s) 
onto the linear space spanned by the basis functions. When the domain of the f unction is 
contin uous, this er r or is e quivalent to the error of a least square series estimation, and Andrew^ 
(| 199 il l and iNewevI (Il997ll pr o vide t he theoretical result for this estim ation. By the assumption 
[H most conditions of iNewevI (|l997l l are satisfied. A rank condition of iNewevI (jl997l l is a critical 
condition. To discuss about it, we denote Sn = {s : s ^ as a set of states observed in 

the set of transition. Since q = |5| increa ses at least or der N, an i.i.d. data generating derives 
IS’atI = Op{N). Hence we can treat N of Newev ( 1997l l and the number of transition as same. 
Then, we obtain 


||n$(s)W(s) - v*{s)\\s < ||n<,(s)W(s) - v*{s)\\s 

/ 1 k 

= o 


1 h 1 m 

—=\og{N) + —= + k'^ p 

Vn Vn 


Thus, we obtain lemma [TJ 


C Proof of Theorem [T] 

Let Cn ■= ||(/>^(s)w 0 - R*(s)||. Then, Lemma □ gives Cn = O log N + + k^~Ey 

Then, from the condition of Theorem [H it is easy to show that Cat = 0{N~^N'j . 

First, we show the consistency of 6. For the purpose, we will show the stochastic equiconti- 
nuity of C{9, V{9)) in 9, we evaluate 

Pr(sup sup \C{9, V{9) - C{9', 1>(0')))I > e), 

e' 0 :|| 6 »- 6 » o ||<5 
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with some positive constant 6 and e. We decompose it as 


\c{e,v{ 0 )-c{e',v{e')))\ 

< |£( 0 , Vie) - c{e’, vie)))\ + me', vie) - me', vie')))\ 

< d^^we -e'\\ + mo', vie) - cie', i>(0')))l, 

with some cd = 0(1)- The last inequality comes from Assumption [2j Before considering the 
second term, we note the following inequality: 

I|i>(0)-i>(0')ll5 

< ||i/( 0 ) - w(0)||5 + \\v*ie) - W(0')ll5 + \mie') - i>(0')ll5 

< \\v*ie)-v*ie')\\s + opii). 

The last inequality is from Lemma [TJ By the definition of the Vis-,e), it is a weighted sum of 
the bounded utility function. Thus, we obtain 

where = 0(1). This inequality is also obtained from Assumption 1. From the discussion, 
we can show 


\\vie)-vie')\\s<c'M\e-e'\\+oiiy, 


Assumption [2] enables us to show 


mo', Vie) - mo', '1>(0')))I < -^'11+ 0 ( 1 ), 


where C'^f = 0(1). 

Finally, we prove that for some <5 > 0, there exists ^ > 0 and 

lim hmsupPr(sup sup |£(0,1^(61) — £(0', 1/(0')))| > e) 

<5^0 rn-oo 6 ' 6»:||e-eo||<5 

< lim hmsupPr(sup sup Cn\\0 — e'\\+ C'^O ~ ^^|| + o(l) > e) < ?■ 
<5^>o AT^-oo e' 0:||e-eo||<<5 


Thus, we c an obtain the s tocha stic equicontinuity of £(0,B(0)). Using Assumption [2] and 


Theorem in 


Van der Vaart ( 200d L we can show consistency ||0 — 6*o||—)• OasV—)• oo. 

To show the convergen ce rate and asy mptotic normality, we apply Theorem of the semi- 
parametric M-estimator in iKosorokI (2000|)- Since we obtain ((at = 0(V“^0)^ Assumption 2-[5] 
is sufficient to make C to be smooth. Assu mption 2 and the above result about the consistency 
satisfies the assumption of Theorem 21.1 in Kosorok ( 200d L Then the result holds. 
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